In this notebook, we will explore our preprocessed dataset. We will analyze the data for outliers, correlated features, and merits in features against the selected features the model outputs. We will use different feature selection methods such as PCA, Forward/Backward selection, and Decision Trees; and will determine which features we can exclude based on the results and error rate. We want to train our models only on the features most important to determining the popularity of the song per genre (with the exception of Decision Tree since it performs its own feature selection).

We use kMeans to explore how well kMeans could predict a song's popularity based on its features and perhaps even if it can cluster songs into genres based on the track's features.

This data has been preprocessed and does not contain null values.

Analyze data for outliers

We have a total of 17,473 track objects, 5,654 of which are our 'popular' (aka target) tracks. The total number of tracks per genre is distributed within a few standard deviations of one another. The average amount of tracks per genre is 4,625. The pop genre has the lowest count with 4,158 tracks. The r&b genre has the highest count at 5,233 tracks.

Analyze data for correlated features

We didn't find many features that are significantly correlated to "popularity" other than "chartrank", but that "chartrank" feature is expected to be correlated and therefore not insightful in this case.

Some stronger correlations to note:

We notice ("popularity", "chartrank") & ("energy", "loudness") correlations are present everywhere, so we will omit its mention any further in the analysis of the genre feature correlations below.

The Jazz genre's "valence" & "danceability" features are somewhat correlated. There's a correlation between the "valence" & "energy" features; "energy" & "acousticness"; "loudness" & "acousticness".

Pop genre correlations: "energy" & "acousticness".

Country genre correlation between the "valence" and "energy" features; "energy" & "acousticness"; "loudness" & "acousticness".

Classification

Use Feature Selection methods to determine which features are most important to determining popularity of the song per genre

Since we know chart rank is correlated to popularity, and we want to base our predictions based on song features, we will drop the 'chartrank' column from our training data

Training data will consist of songs prior to 2019, a total of 20,618 tracks. Test data will consist of songs in 2019 & 2020, summing up to 2,509 tracks. Roughly 12% of our raw data is for testing. Our target will be predicting 'popular' or 'unpopular'.

Multiple methods will be implemented to test the parameters that yield the highest accuracy per model, as well as visualize the cross-validation score on a more granular level: per-parameter basis.

Decision Trees

One of the easiest and most efficient ways to determine feature selection is by using Decision Trees. We will start by analyzing all the features for all the tracks. Then, we will test different parameters and will use the parameters that yield the highest accuracy on our model. Finally, we will see if we could improve our accuracy by repeating this process on a more granular level - separating tracks by genre.

Though we were able to get the best parameters all at once, let's plot one of those parameter ranges to

  1. better understand the reasoning behind our model's selection; and
  2. confirm there aren't better parameters to choose from.

For now, let's plot the cross-validation against the max depth parameter range.

Later we'll plot the other 2 parameter ranges: min samples split & min samples leaf.

The plot suggests that a max-depth of 1 is sufficient for returning the highest accuracy. We need to find out which feature the DT is splitting on and analyze its merits. Let's create a Decision Tree model with the suggested min_samples_split & min_samples_leaf values, & set the max_depth at the 3rd highest accuracy rate to get more variance. Them, we'll analyze the classification report to determine how well our model is doing in both precision and recall using those "best" parameters.

As we can see from our low recall score for classifying 'popular', our decision tree model performs worse than randomly guessing at predicting popular track features. The 71% accuracy stems from it's better ability to predict unpopular tracks.

Let's see which features were used in our Decision Tree model.

Based on all the genres and all the features, our Decision Tree selected 'loudness' as the most informational feature to split on. In the 2nd level, it split on 'acousticness' and 'year'. We can dismiss the 'year' feature since our data aims to predict popularity based on audio features. At the 3rd level, it splits on 'explicit_1', 'danceability', and 'loudness' again.

Using these best params, we can next test on max_features and assign a min_impurity_decrease value to help prune off features that don't return more information.

Let's see if our Decision Tree model improves if we fit it on a per-genre basis. We will re-evaluate the previous parameters in order to better fit our smaller datasets. First, we'll observe each parameter range individually for cross-validation scores. Then we'll test on all the parameters to get the best combination of parameters.

Overall, our accuracy only improved in the Jazz, Pop, & R&B genres compared to the DT model based on all audio features across all genres. Only in the R&B genre did we improve our recall score, which was still no better than guessing.

Naive Bayes

Training on tracks across all genres for popularity

Training on tracks per genres for popularity prediction

Training on tracks per genres for genre prediction

In conclusion, out of all of our classification models, Naive Bayes scored the highest in terms of recall. For track prediction, it scored 57% on predicted unpopularity for tracks across all genres & 64% on predicting popularity. In predicting a track's genre, it scored high in the country and jazz genres, mediocre in r&b, and poorly in latin and pop. Averaging that out, the data we are passing into our models is only returning scores slightly better than random guessing.

K Nearest Neighbors

Training on tracks across all genres for popularity

First, we'll predict 'popular' vs 'unpopular' on all the tracks for all genres using K Nearest Neighbor as our classifier with Euclidean distance as our distance metric. We will perform a grid search on the best 'k'.

Let's evaluate the best number of nearest neighbors to optimize our knn prediction. Scoring will be based on 'recall'.

scoring = 'recall'

The elbow is at roughly 2 neighbors. Let's evaluate our classifier's accuracy on its 2019-2020 popularity predictions based on audio features:

Though our f1-score is worse than the f1-score of our decision tree, our recall for predicted popularity is much higher!

Training on tracks per genres for genre prediction

scoring = 'accuracy'. GridSearchCB can't score on 'recall' because genre prediction is multiclass.

Next, let's see how well kNN can predict a track's genre. Again, we'll start with a grid search to find the best number of neighbors to predict on.

PCA

Training on tracks across all genres for popularity

Let's see if our accuracy scores improve by implementing feature selection using the Prinicipal Component Analysis.

Since PCA was not designed for categorical variables, we will only perform PCA on our continuous features.

scoring = 'recall'

Let's investigate the best recall score on predicting popularity on the top 3 k neighbors (based on the cross validation scores above).

Training on tracks for genre prediction

When testing on all features for popularity prediction, our best scores were

          precision    recall  f1-score   support

       0       0.73      0.71      0.72      1739
       1       0.39      0.41      0.40       770

accuracy                           0.62      2509


When testing on all features for genre prediction, our best scores were

          precision    recall  f1-score   support

 country       0.64      0.61      0.62       518
    jazz       0.78      0.82      0.80       453
   latin       0.52      0.47      0.49       513
     pop       0.37      0.24      0.29       450
     r&b       0.46      0.63      0.53       575

accuracy                           0.56      2509

When testing on all PCA transformed feature set for popularity prediction, our best scores were

          precision    recall  f1-score   support

       0       0.71      0.82      0.76      1739
       1       0.39      0.26      0.31       770

accuracy                           0.65      2509

When testing on all PCA transformed feature set for genre prediction, our best scores were

          precision    recall  f1-score   support

 country       0.37      0.39      0.38       518
    jazz       0.71      0.81      0.76       453
   latin       0.46      0.44      0.45       513
     pop       0.26      0.21      0.23       450
     r&b       0.35      0.36      0.35       575

accuracy                           0.44      2509


In conclusion, selecting features based on PCA mildly improved our accuracy scores. However, the recall scores on predicting popularity & genre decreased on the PCA transformed dataset, where it matters most.

Backward Selection

We had planned to use sklearn's SequentialFeatureSelector method for Forward & Backward feature selection, however, we were unable to import it because it was introduced in scikit-learn version 0.24. Anaconda's environment is running on version 0.23.3, without an option to update.

Cluster Analysis

DBSCAN

So far, our models have failed to return a satisfactory accuracy/recall score to predict audio features that would indicate a 'popular' track. Instead, we may find better luck in detecting a genre based on audio features. We will use the density based clustering method (DBSCAN) to predict a track's genre. DBSCAN is useful in detecting outliers as well, which will be interesting to visualize if our results are meaningful.

Fit on all features

We remove the 'year' feature since it should be irrelavant to genre prediction.
Our classification methods proved 'duration' to be more useful than we had initially imagined, in particular with predicting popularity within the pop, latin, & jazz genres. This makes sense for the Jazz genre, but we're not sure why that would also apply to the Latin & Pop genres. Thus, we shall keep this feature in the data to see if it'll add any more insights to our findings.

Code source below provided by scikit-learn's DBSCAN demo: https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

Using our training data

min_samples = 10

DBSCAN with epsilon value of 0.1 returned 0 clusters, with all points considered "noise points".

DBSCAN with epsilon value of 0.2 returned 54 clusters, with 14,118 considered "noise points".

DBSCAN with epsilon value of 0.3 returned 71 clusters, with 5,793 considered "noise points".

DBSCAN with epsilon value of 0.4 returned 76 clusters, with 2,339 considered "noise points".

DBSCAN with epsilon value of 0.5 returned 69 clusters, with 1,110 considered "noise points".

DBSCAN with epsilon value of 0.6 returned 70 clusters, with 708 considered "noise points".

DBSCAN with epsilon value of 0.7 returned 72 clusters, with 555 considered "noise points".

DBSCAN with epsilon value of 0.8 returned 72 clusters, with 492 considered "noise points".

DBSCAN with epsilon value of 0.9 returned 74 clusters, with 430 considered "noise points".

A small min_sample of 10 returned an average cluster of 62. The range of cluster size for that min sample across an epsilon value between 0.0 & 1.0 was 54 to 74.

min samples = 50

We know we have at least 3,000 track samples per genre, so let's increase our min_sample to 50. We chose 50 because at 100 we were getting errors.

As we see in the plot, our accuracy scores did not improve. The last epsilon value of 2.0 with the accuracy score of 100% was because it clustered all points into 1 cluster, making that score invalid. Epsilons > 2.0 were thus not calculated.

A lower epsilon value returned cluster sizes closer to our target cluster size of 5, however, it considered all other points as noise. This leads us to believe we may have too much data.

Tracks per genre in train sample:

Using our test data to simulate a smaller sample size

Next, we will test clustering on a smaller sample size - our test data. Our theory is that by reducing our sample size, we might get better completeness and homogeneity scores.

Tracks per genre in test sample:

min_sample = 50

An epsilon value of 0.7 gave us the best v-measure at 0.078 (combination of homogeneity & completeness). However, that is still a poor score and estimated 13 clusters, whereas an epsilon of 0.5 gave us the closes number of clusters to the true data at size 7.

min_samples=100

Higher min samples gave us better results, but the results were still poor. Across all tests our accuracy rate was <10%. Cluster sizes were either too low or too high from our goal of 5. Minimizing our sample size did not make a significant impact on the results.

Fit on PCA transformed features

Next, we will test clustering on a smaller feature size. Our theory is that by reducing our feature size, we might get better completeness and homogeneity scores.

min_samples = 10

min_samples = 50

Using the PCA transformed dataset with density based clustering did not return significant improvements, but it at least consolidated around the recommended epsilon value of 0.2.

kMeans

We use kMeans to explore how well kMeans could predict a song's genre based on its features.

Using all features

Cluster 2 seems the closest to any real cluster.

Let's see how well it'll predict on our test sample.

Using PCA dataset

Although we were able to give our kMeans algorithm the correct amount of genres (aka clusters) to predict on, we still find ourselves with low accuracy scores:

Fitting our model on the PCA transformed data slighlty improved our accuracy scores:

This only furthers our conclusion that one cannot predict a song's popularity nor its genre by the basic audio features given to us by Spotify.